Aims to provide a simple and high performance kernel for building coordination primitives in
the client
Provides per-client guarantee of FIFO execution and linearizability of requests
The entire idea is instead of implementing synchronisation primitives server-side, they expose
an API and allow clients to define their own
Exposed API manipulates simple wait-free data objects organised hierarchially
ZooKeeper resembles any other filesystem (Chubby without lock, open and close methods)
Exposed API can implement consensus for any number of processes
Pipelined architecture --> FIFO per clients for free
Guaranteeing FIFO enables clients to submit requests asynchronously:
If a new leader is elected, it has to update metadata. Metadata updates are queued
asynchronously and initialisation is of sub-second order (compared to synchronous
initialisation)
Zab --> Leader-based atomic broadcast protocol (to order things)
Read operations are not totally ordered
Client-side caching is used for things like leader id (observer pattern in cache):
Better than Chubby, since Chubby pauses updates to invalidate all caches (that use changed
data)
Chubby uses leases to manage slow/faulty clients, but ZooKeeper avoids the problem
altogether by allowing clients to manage cache
Only writes are linearizable
The ZooKeeper service:
Overview:
Client -- user of the ZooKeeper service
Server -- process providing the ZooKeeper service
Znode -- in-memory data node in the ZooKeeper data
Referring to a znode is done with standard UNIX notation of A/B/C
Znodes:
Regular -- created and deleted by client explicitly
Ephemeral -- Created explicitly, deleted explicitly or when the creating session
terminates
Znodes have a sequential flag set upon creation. If set it appends the value of a
monothonically increasing counter to the znodes name
Unlike files, znodes are not designed for general data storage
Have associated metadata, timestamps and version counters (allows tracking and
conditional updates)
Watches (Observer subscriptions):
Update clients in a timely manner and avoid polling
Read operations have a watch flag. When set the server promises to notify the client when
the information it has just returned (as part of a read) has changed
Watches are one-time triggers associated with a session (unregistered once triggered or
the session closes)
Watches indicate that a change has happened but do not provide the change
Data Model:
Filesystem (key/value storage with hierarchial keys) with full data reads and writes
Hierarchial namespace is useful for distinguishing applications and setting access rights
Sessions:
Client connection initiates the session
Sessions have a timeout
Client marked as faulty if there are no updates from it in the timeout window
Session is ended when the client closes it, or marked as faulty
Within a session the client observes a succession of state changes (execution of its
operations)
Sessions enable clients to move transparently from one server to another (persist across
servers)
Client API:
Allows creation, deletion and exists check on znodes
Get and set data stored in the znode
Get children of a znode
Waiting for all updates pending at the start of the operation to propagate to the server
that the client is connected to (sync call).
All methods have both - synchronous and asynchronous versions available
ZooKeeper does not use handles to access znodes (every request uses the full path)
Each update method takes an "expected version number" --> enable conditional updates
Expected version -1 --> no version check on update
Guarantees:
FIFO and linearizability (a-linearizability -- client is multithreaded)
Read requests are processed locally at each replica